In this part of the workshop we will go through a sample workflow with the CAGEr R package (Haberle et al., 2015). Two samples were chosen at different stages in zebrafish development for this practical: 512 cell stage (~2,7 hours hpf) and Prim 6 (~24 hpf). These stages have been previously published by Nepal et al, 2013 and are also available in one of the R-packages that can be found here.
We will only go through all the standard functions and analyses of CAGEr in this practical. In the following practicals of this workshop we will explore the data further.
CAGEr Analysis
There are various publicly available CAGE-seq data that can be easily accessed and uploaded into R with CAGEr. See the vignette (section 5) for all the publicly available data and ways how to import them in CAGEr including:
FANTOM 5: Human and mouse primary cells, cell lines and tissue
FANTOM 3 and 4: Human and mouse tissues as well as several timecourses
ENCODE: Common human cell lines
Zebrafish Developmental Timecourse: Twelve developmental stages
As we will be using the two stages (512 cells and Prim6) of zebrafish development, let’s start by loading in the R-package for the zebrafish developmental timecourse as well as CAGEr:
# packages
require(ZebrafishDevelopmentalCAGE)
require(CAGEr)
We now can display the samples in this package. First by loading data from the zebrafish samples that will return a dataframe with information about the twelve samples:
# loading data
data(ZebrafishSamples)
head(ZebrafishSamples)
# samples
samples <- as.character(ZebrafishSamples$sample) # to see all the samples
samples
We will pick just two developmental stages: “zf_512cells” and “zf_prim6”. In the vignette (and manual) the way of loading in these data is by the function importPublicData(). We need to specify the source, dataset, group, and samples. The source is “ZebrafishDevelopment” and the rest are found in the dataframe of the ZebrafishSamples: ZebrafishCAGE, development, samplenames. The avoid spelling everything out we’ll use the character vector of samples:
# creating a CAGEset object:
myCAGEset <- importPublicData(source = "ZebrafishDevelopment",
dataset = "ZebrafishCAGE",
group = "development",
sample = samples[c(4,11)] )
What does this look like? Let’s type myCAGEset in R:
myCAGEset
As you can see the S4 object contains slots to store the data as we go along that is grouped in the summary under the sections:
Input data information
CTSS information
Tag cluster (TC) information
Consensus cluster information
Number of consensus clusters
Expression profiling
Promoter shifting
Double check the input data information section for the reference genome and the sample labels!
At this stage we only have Cage TSS (CTSS) information. This is already the form of the data of this package. Only the (actual) tag count for each CTSS for both samples is stored here. The slot of Normalized tpm is (still) empty as is the rest of the object.
If for example bam files are used, you would have to determine these CTSS still (see example code below and CAGEr vignette).
myCAGEset = new("CAGEset", genomeName = "BSgenome.Drerio.UCSC.danRer7",
inputFiles = file.path(..), inputFilesType = "bam",
sampleLabels = c("zf_512cells","zf_prim6")
getCTSS(myCAGEset)
With the following few standard CAGEr functions we will “fill in” all these slots and check some general features of the two samples.
How similar are the two samples? CAGEr has a function to look at the correlation between the samples and plots a png in the working directory: plotCorrelation().
plotCorrelation(myCAGEset, samples = "all", method = "pearson")
To access the correlations displayed in the terminal, just assign the function to corr.m to keep a matrix of correlations:
corr.m <- plotCorrelation(myCAGEset, samples = "all", method = "pearson")
At this point we have the actual tag counts per CTSS. Do we have different library sizes for these samples? To check our library sizes use librarySizes() on our cageset.
# overview library sizes:
librarySizes(myCAGEset)
To make samples comparable, we will need to normalize our data. CAGEr offers both a tags per million normalization and a power-law based normalization. Because many CAGE-seq data follow a power-law distrubtion (Balwierz et al., 2009), we’ll use the power-law based normalization here.
First we need to plot the reverse cumulatives with plotReverseCumulatives(). On the log-log scale this reverse cumulative distribution will manifest as a monotonically decreasing linear function of which the slope (alpha) is determined per sample.
# normalisation first plot to assess alpha
plotReverseCumulatives(myCAGEset, fitInRange = c(5, 1000), onePlot = TRUE)
From this we can see the alpha for each sample. Select the alpha for the normalization step, the normalizeTagCount() function when method = “powerlaw”.
What is the average alpha of the two samples? Is that the “Ref distribution” in the plot?
normalizeTagCount(myCAGEset, method = "powerLaw",fitInRange = c(5, 1000), alpha = 1.20, T = 1*10^6)
Let’s check our object again, what is added and in which slot?
myCAGEset
CTSS’ in close proximity of each other give rise to functionally similar set of transcripts within the same promoter elements. Thus, they are basically larger transcriptional units that correspond to individual promoters.
To capture these, tag clusters (TC) of CTSS’ can be easily defined within CAGEr with the function: clusterCTSS(). Run the code below to see the information in the viewer
?clusterCTSS
We will use a simple distance measurement (method = “distclu”). This means that the true distance in bp between individual CTSS is used. Let’s set this at 20 bp (maxDist = 20), so that individual CTSS can not be more than 20 bp apart to form the TC.
We also set the threshold at 1, which means that each CTSS should have 2 or more tag counts in all the samples prior to clustering so as to exclude low fidelity CTSSs.
Finally, no TC are to be called comprised of a single CTSS if the normalized signal is below 5:
clusterCTSS(object = myCAGEset,
threshold = 1,
thresholdIsTpm = TRUE,
nrPassThreshold = 1,
method = "distclu",
maxDist = 20,
removeSingletons = TRUE,
keepSingletonsAbove = 5)
Keep in mind that these are determined per sample
Have a look again at the myCAGEset! The Tag cluster information slot has now been filled in.
Another feature we can determine with CAGEr is the promoter width. That is in this case, the width of each tag cluster per sample. CAGEr has a function that calculates the TC width based on the cumulative distribution of tag signal per TC. See the figure below.
Generally, the width of a TC (promoter) is set between the quantiles 0.1 and 0.9 to capture 80% of CTSS within the TC. To determine the width we follow these two steps:
cumulativeCTSSdistribution())quantilePositions())cumulativeCTSSdistribution(myCAGEset, clusters = "tagClusters")
quantilePositions(myCAGEset,
clusters = "tagClusters",
qLow = 0.1,
qUp = 0.9)
We can easily plot a histogram of the interquantile widths per sample with CAGEr too:
plotInterquantileWidth(myCAGEset,
clusters = "tagClusters",
tpmThreshold = 3,
qLow = 0.1,
qUp = 0.9)
If you look at these two plots, you can see that on avarage, the 512 cell stage has more narrow TCs than prim6 (which is what we would expect). This is great to see general differences in promoter width between samples.
Until now, we have determined the TC and the width of those per individual sample. However, TCs are often sample-specific. This could mean that they are present in one sample but not in the other. More so, TCs do not coincide perfectly within the same region but overlap each other as well as the possibility of more than one TC in one sample and a less amount of TC in the other but larger in width (we’ll visualize an example of this later today).
When we want to compare transcriptional activity across samples, regions that encompass the TCs across the samples are needed. These are called consensus clusters that are aggregates of TCs from all samples. CAGEr has a function for this too: aggregateTagClusters().
These we will use and extract these later during the workshop. For now we will add these to our myCAGEset. We specify that the maximum distance is a 100 bp between the TCs (with the coordinates of quantile 0.1 and quantile 0.9) across the samples.
aggregateTagClusters(myCAGEset,
tpmThreshold = 5,
qLow = 0.1,
qUp = 0.9,
maxDist = 100)
CAGE signals are essentially measuring transcription starting from CTSS and thus the transcription levels can be assessed. This can be done per CTSSs, TCs, or consensus clusters.
CAGEr offers two methods to cluster gene expression to identify gene expression patterns:
Both require the number of expression clusters to be known a priori.
Here we will perform the SOM algorithm at consensus cluster level for our two samples to identify expression patterns. We set the threshold to at least 10 tpm (normalized) in at least one sample to make sure we have robustly expressed TCs within the consensus clusters.
The function is getExpressionProfiles() and in here we define that we expect in this case 6 clusters (xDim = 3, yDim = 2).
getExpressionProfiles(myCAGEset,
what = "consensusClusters",
tpmThreshold = 10,
nrPassThreshold = 1,
method = "som",
xDim = 3,
yDim = 2)
What happens if you change the tpmThreshold? Or change the xDim and yDim?
These results can be plotted:
plotExpressionProfiles(myCAGEset, what = "consensusClusters")
SOM
Subsequently, a character vector of the expression classes can be extracted from the cageset. This will be the same order as the consensus clusters.
expr.clus = expressionClasses(myCAGEset, what = "consensusClusters")
Promoter shifting is described by Haberle et al, 2014. They’ve shown that the same promoter can be used differently in samples at different times in development. The method from this paper has been implemented in CAGEr. Shifting can be detected between two individual samples. If there are more than two samples, they can be merged per group (if possible) and the two groups are then compared.
The shifting uses, similary to determining promoter width, the cumulative distribution per sample of CAGE signal (here: within the consensus clusters). The difference between the cumulative distributions is calculated as a shifting score (see figure below). Additionally, a Kolmogorov-Smirnov test on the cumulative distributions is performed to give a general assessment of differential TSS usage.